Congenital idiopathic megaesophagus in the German shepherd dog is a sex-differentiated trait and is associated with an intronic variable number tandem repeat in Melanin-Concentrating Hormone Receptor 2

Congenital idiopathic megaesophagus (CIM) is a gastrointestinal (GI) motility disorder of dogs in which reduced peristaltic activity and dilation of the esophagus prevent the normal transport of food into the stomach. Affected puppies regurgitate meals and water, fail to thrive, and experience complications such as aspiration pneumonia that may necessitate euthanasia. The German shepherd dog (GSD) has the highest disease incidence, indicative of a genetic predisposition. Here, we discover that male GSDs are twice as likely to be affected as females and show that the sex bias is independent of body size. We propose that female endogenous factors (e.g., estrogen) are protective via their role in promoting relaxation of the sphincter between the esophagus and stomach, facilitating food passage. A genome-wide association study for CIM revealed an association on canine chromosome 12 (P-val = 3.12x10-13), with the lead SNPs located upstream or within Melanin-Concentrating Hormone Receptor 2 (MCHR2), a compelling positional candidate gene having a role in appetite, weight, and GI motility. Within the first intron of MCHR2, we identified a 33 bp variable number tandem repeat (VNTR) containing a consensus binding sequence for the T-box family of transcription factors. Across dogs and wolves, the major allele includes two copies of the repeat, whereas the predominant alleles in GSDs have one or three copies. The single-copy allele is strongly associated with CIM (P-val = 1.32x10-17), with homozygosity for this allele posing the most significant risk. Our findings suggest that the number of T-box protein binding motifs may correlate with MCHR2 expression and that an imbalance of melanin-concentrating hormone plays a role in CIM. We describe herein the first genetic factors identified in CIM: sex and a major locus on chromosome 12, which together predict disease state in the GSD with greater than 75% accuracy.


Introduction
Esophageal motility is an integrated neuromuscular process that, when dysregulated, causes an array of digestive disturbances [1]. Normally, the consumption of foods and liquids stimulates afferent signaling of vagus nerve receptors extending from the pharynx to the lower esophageal sphincter (LES), triggering an efferent vagal response comprised of peristaltic contractions and LES relaxation [2,3]. In humans, the most recognized and studied esophageal dysmotility is achalasia [4], characterized by constriction of the LES and aperistalsis, causing difficulty swallowing, coughing, chest pain, and regurgitation [5,6].
The most common esophageal dysmotility in dogs is congenital idiopathic megaesophagus (CIM) [7]. While gravity aids motility of the vertical human esophagus, it does not facilitate food passage in the horizontally-oriented canine esophagus. CIM-affected dogs have ineffective peristalsis, which leads to food retention that stretches and dilates the esophagus [8]. Overt clinical signs include coughing and regurgitation, usually beginning upon weaning at around four weeks of age [3,9]. CIM encompasses a broad phenotypic spectrum ranging from subclinical cases that may only be detected via radiography to severe cases with regurgitation episodes several times a day [7,9]. A CIM diagnosis is confirmed by observation of esophageal dilation on thoracic radiographs, with or without barium contrast [3,10] (Fig 1). Affected puppies fail to thrive and are at risk for aspiration pneumonia [11] and intussusception [12,13].
Neonatal mortality is high, but many CIM cases can be managed with a high-caloric liquid diet, frequent meals, and an elevated feeding regimen wherein dogs are held vertically to facilitate passage of food into the stomach [14]. Recently, administration of sildenafil was shown to ameliorate the clinical signs of both canine CIM and human idiopathic achalasia by promoting relaxation of the LES [15,16]. Most dogs with CIM require lifelong symptomatic management, but 20% to 46% of cases will spontaneously resolve by one year of age, suggesting that the disease may be attributed to delayed nerve development in the esophagus [3,17,18]. An esophagus-specific defect in afferent vagal innervation has been described in CIM-affected dogs [19,20]. CIM occurs across breeds, but the German shepherd dog (GSD) has the highest incidence, followed by Labrador retrievers, Great Danes, dachshunds, and miniature schnauzers [18,21-numbers for the 1,330 dogs of pure and mixed breeds and 54 wild canids used for variant filtering and VNTR genotyping are available in S3 Table. SNP chip data are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad. f7m0cfxz3. 23]. We previously hypothesized that heritable factors underlie the high frequency of CIM in the GSD, and a preliminary study indicated a suggestive region of association on chromosome 12 and a complex pattern of inheritance [24]. We aim herein to conduct a robust genomewide association study (GWAS) to identify genomic regions contributing to CIM and identify genetic variants that can be used as a tool to facilitate breeder efforts to reduce disease incidence.

Study population
We recruited blood or buccal samples from 530 GSDs: 124 CIM-affected (70 males, 54 females) and 406 unaffected (165 males, 241 females) dogs (S1 Table). Samples were obtained primarily from two discrete United States populations: "pet GSDs" representing pets and other privately-owned dogs (108 affected, 303 unaffected) and "service GSDs" from breeding colonies maintained by multiple service organizations (16 affected, 103 unaffected). Genome-wide Phenotypic records were obtained from 755 GSDs (16 are part of the aforementioned study population) from a service organization that maintains a private breeding colony (S2 Table). All dogs (affected and healthy) underwent barium studies at five weeks of age. In this larger cohort with stringent phenotyping, a significant proportion of cases were male (Fig 2A; 109 males, 62 females, P-val = 0.0004). Because GSD adult males are larger than females, we investigated whether body weight is correlated with CIM. Between the sexes, birth weights did not differ significantly (Fig 2B; 346 males, 330 females, P-val = 0.41). Weights of affected individuals were not significantly different from controls at birth (Fig 2C; 92 affected, 584 unaffected, P-val = 0.58), or at adulthood (Fig 2D; 35 affected males, 259 unaffected males, P-val = 0.24).

Genome-wide association study
We conducted a GWAS for CIM, with sex as a covariate, using 59 cases (24 female, 35 male), 53 controls (35 female, 18 male), and 117,451 SNPs, after filtering. A single region of association extending from 56.5 to 60 Mb on chromosome 12 includes 82 SNPs exceeding Bonferroni correction (P-val � 4.26x10 -7 ; Fig 3A and 3B). The lead SNP, chr12:58158449, has a P-val of 3.12x10 -13 and R-squared of 0.43. High linkage disequilibrium (LD) with the lead SNP, defined by r 2 >0.6, extends from 57.3 to 58.4 Mb and contains eight protein-coding genes (Fig 3C), none of which is known to underlie phenotypes in the dog.

Identification of candidate variants
Whole genome resequencing (WGS) data (ranging from 30 to 54X coverage) were generated for three, ancestrally diverse, affected female GSDs that were homozygous for the risk alleles of the leading 10 chromosome 12 SNPs (S3 Table). In the aforementioned 1.1 Mb region of high LD, 1,737 variants were homozygous in all three affected dog genomes. We used a variant call format (VCF) file containing WGS data from 1,330 domesticated dogs of pure and mixed breeds to evaluate non-structural variant allele frequencies. None of the 1,737 variants are unique to the affected dogs or the GSD breed. We generated structural VCF files and manually scanned the three affected GSD genomes in Integrative Genomics Viewer (IGV) to identify mobile elements and large deletions and insertions within the 1.1 Mb interval. All structural variants are present in multiple non-GSD genomes. No variants in the interval are predicted to impact protein sequence. Together, these results indicate that the CIM-associated variant is a non-coding polymorphism found across breeds.
We observed that 85% of affected dogs from the GWAS were homozygous for the risk allele, compared to only 19% of controls, indicating that homozygosity for the chromosome 12 locus is a strong risk factor. We therefore delimited a narrower interval of 648 kb, wherein a maximum number of affected individuals are homozygous for the risk haplotype, defined by four heterozygous individuals to the centromeric end and three telomerically (Fig 4). Filtering to retain homozygous variants present in the three CIM genomes and absent from a publiclyavailable male GSD genome lacking the risk haplotype yielded 577 variants. Of these, 21 were in intronic or untranslated regions of Melanin-Concentrating Hormone Receptor 2 (MCHR2, ENSCAFG00000003533.5), and the remaining were located intergenically to protein-coding genes and outside of promoter regions.
We next selected candidate variants for genotyping in a larger cohort to further assess their association with CIM. Within the 648 kb region of high homozygosity among cases (Fig 4), we identified three intronic variants in positions potentially impacting splicing or regulation of MCHR2: 1) a 4 bp deletion located 20 bp upstream of the exon six splice acceptor site, 2) a SNP in a transcribed region of an antisense transcript (CFRNASEQ_AS_00025246), and 3) a 33 bp variable number tandem repeat (VNTR) in a transcribed region of a non-coding transcript (CFRNA-SEQ_IGNC_00025249), upstream of the MCHR2 translation initiation site. We also identified a compelling structural variant, a long-interspersed nuclear element (LINE) insertion, that lies within an intron of a lincRNA (CFRNASEQ_IGNC_Spliced_00025252) expressed in esophagus and brain. In the larger data set, the 4 bp deletion did not segregate with CIM, and the more distant LINE insertion was less significantly associated than either the antisense SNP or the VNTR. The latter two variants were similarly highly associated (S4 Table), with the differences in P-value appearing to be driven predominantly by genotypic changes among unaffected dogs.
The antisense SNP is not well conserved evolutionarily across mammals, including those expressing MCHR2, and occurs in a transcript that is not annotated in the genomes of other species. Within the 33 bp VNTR are multiple predicted binding motifs, most notably an 8 bp T-half site (TCACACCT; P-val = 3.41x10 -6 from TOMTOM) that matches the optimal consensus binding sequence for T-box family members (Fig 5) [25-28]. We focus the remainder of this study on the VNTR, although the complex inheritance of CIM and high regional LD prevent the exclusion of other linked variants as contributors to CIM.
Together, the VNTR and sex predict disease with 77% accuracy. To identify additional loci involved in CIM, we conducted a second GWAS using the full cohort (59 cases, 53 controls) with sex and VNTR genotypes as covariates. No signals surpassed or approached Bonferroni significance (S3 Fig).

Discussion
Our study reveals a sex bias in CIM and a strong association with a VNTR in MCHR2 on chromosome 12. MCHR2 encodes one of two G-protein coupled receptors for melanin-concentrating hormone (MCH) [29,30], a neuropeptide synthesized in the region of the brain critical for feeding and reward [31,32]. MCH levels are directly correlated with food intake, weight, and gastrointestinal (GI) motility [33][34][35][36][37][38]. MCH is expressed across mammals, but MCHR2 transcripts are only present in dogs, primates, and other higher order members [39,40]. Transgenic mice expressing human MCHR2 have reduced food intake and body weight [40], whereas humans with deletions encompassing MCHR2 and an adjacent gene, Single-Minded Homolog Within intron one and upstream of the translational start site in exon two of MCHR2, we found a 33 bp canid-specific VNTR that contains a T-box binding consensus sequence known as a T-half site. The number of VNTR copies is inversely related to probability of CIM disease: dogs having six total copies are least likely to be affected whereas those with only two total copies have the highest disease incidence. The T-half site is bound by T-box transcription factors [25][26][27][28], and previous studies have illustrated that the number of T-half sites directly correlates with DNA binding [27]. T-box transcription factors can repress or activate target genes [43]. Future studies will be necessary to determine if MCHR2 expression correlates with the number of T-half sites and if through this mechanism the VNTR influences MCH concentrations and GI motility.
GI motility is a key mediator of the sensations of hunger and fullness [44], with accelerated gastric emptying causing a shorter period of satiety and a stronger desire for food [45]. Feeding behaviors are under selection in dogs because food is commonly used as an incentive for positive behavior [46,47,48]. For example, a pro-opiomelanocortin (POMC) mutation associated with hunger and weight in Labrador retrievers has higher frequencies in service dog populations [46,47]. Neurons expressing POMC contribute to satiety signaling via regulation of GI motility [49], and it is worth noting that MCH mediates food intake through inhibition of POMC neuronal activity [50]. We posit that the number of VNTRs is directly related to GI motility: more repeats correlate with higher food motivation and protection from CIM, and fewer repeats correspond with reduced appetite and increased disease risk [51].
Our study reveals a significant sex bias in CIM. Females are affected less often than males and have lower penetrances of the VNTR risk genotypes. Although body size is a fundamental sexually-dimorphic trait, our data illustrate that CIM does not correlate with birth or adult weights. These findings hint at a female protective effect, wherein females have a biological advantage and therefore require a greater number of risk alleles, or genetic liability, to manifest CIM than do males [52].
The female sex hormone, estrogen, plays a role in increasing concentrations of the smooth muscle dilator nitric oxide (NO), which is the major neurotransmitter responsible for relaxing the LES [53,54]. Sex hormones are secreted before and after birth [55,56,57], thus they can impact the development of congenital disorders, like CIM. Higher female sex hormone levels have been linked to decreased LES pressure in pregnant women and postmenopausal women undergoing hormone replacement therapy [54]. Female dogs may have greater LES relaxation due to endogenous factors (e.g., higher estrogen levels), thereby facilitating the passage of food into the stomach and preventing the food retention that causes megaesophagus. We propose that in the absence of this protective effect, males are more susceptible to CIM. Our observations are consistent with male biases in human esophageal disorders, including reflux esophagitis and esophageal cancer, in which estrogen is thought to play a role [58].
In humans, the esophagus is comprised predominantly of smooth muscle, whereas in canids, nearly the entire length of the esophagus is striated muscle [59,60]. The LES is the only component of the canine esophagus dilated by NO [61]. Sildenafil, a drug widely used to treat CIM, reduces LES tone through the prevention of NO degradation [16]. NO levels also directly correlate with MCH levels [62], suggesting that a MCH imbalance may contribute to CIM status by impacting LES pressure.
In summary, we have uncovered a sex bias in CIM and a VNTR, intronic to MCHR2, that is strongly associated with CIM in GSDs. Together, sex and the VNTR predict greater than 75% of disease risk, but it is clear that there are additional factors influencing CIM in the breed. A genetic test is now available to help breeders increase the frequency of the low-risk allele 3. Further studies are warranted to investigate the contribution of the VNTR and sex to CIM in other breeds, as well as gastric dilatation-volvulus (bloat), another GI motility disorder highly prevalent among GSDs [63].

Ethics statement
All samples were obtained with informed consent according to protocols approved by the Clemson University Institutional Review Board (2013-18).

Biologic sample population
Whole blood or buccal cells were obtained from 124 CIM-affected and 406 unaffected privately-owned and service GSDs from across the United States. All cases were diagnosed by a veterinarian via exclusion of non-idiopathic causes (e.g., persistent right aortic arch, myasthenia gravis) and a history of clinical signs from puppyhood, in conjunction with a standard or barium contrast radiograph. Pedigrees and radiographs were collected when available. Among GWAS cases, 95% of dogs were diagnosed at under one year of age. GWAS controls were over one year of age with no history of clinical signs consistent with CIM and no known relatives affected by CIM. Genomic DNA was isolated using the Gentra Puregene DNA Isolation kit (Qiagen). DNA concentration was quantitated by a NanoDrop 1000 spectrophotometer (Thermo Scientific).

Phenotypic data population
Sex and CIM-affection data were collected from 755 affected and unaffected GSDs from a private breeding colony. All dogs underwent a barium swallow examination at five weeks of age. Birth and adult weight were obtained from subsets of 676 and 599 dogs, respectively.

Genome-wide association and LD analyses
Individuals were selected for the association study such that sex and coat color were roughly balanced between cases and controls, and known relatives were excluded. Genome-wide SNP profiles were generated for 114 dogs (60 female, 54 male) using the Illumina CanineHD Bead-Chip, containing 220,853 SNPs (GeneSeek, Inc.). All filtering and statistical analyses were performed using SNP & Variation Suite v8 (SVS, Golden Helix) with chromosome positions in CanFam3.1. Two samples having call rates < 80% were pruned, as were 103,402 markers having < 95% call rates, minor allele frequencies < 0.05, and/or Hardy Weinberg Equilibrium P-values < 0.0004. Combined-sex GWASs for CIM were conducted with sex and VNTR genotypes as covariates, and P-values were calculated using a linear regression following a full vs. reduced model. All 53 controls were used in LD pairwise analyses between the lead SNP (chr12:58158449) and chromosome 12 SNPs, and plotted via LocusZoom [64]. Sex-specific GWASs were conducted using a linear regression following a full model and a marker set identical to the combined-sex GWASs. All chromosome positions are reported in CanFam3.1.

Whole genome resequencing
Three affected GSDs were selected for WGS: a black/tan female with German ancestry (SRR15446412), a white female from the Netherlands (SRR15446416), and a black/tan female from an American service dog breeding colony (SRR15446414). Resequencing of genomes from the latter two dogs was performed using an Illumina HiSeq X Ten, generating 2x150 bp paired-end reads. Total reads generated ranged from 861 to 869 million per sample. Pairedend reads were trimmed, aligned to the indexed reference (CanFam3.1), sorted, and indexed to be viewed in IGV [65] using the Illumina DRAGEN (Dynamic Read Analysis for GENomics) Bio-IT platform [66]. WGS data for the third affected GSD were generated on an Illumina HiSeq 2000, with 2x125 bp paired-end reads. A total of 584 million reads were trimmed, aligned to CanFam3.1 with Bowtie2 [67], and sorted and indexed using SAMtools [68].

Variant filtering
The DRAGEN pipeline was used to generate VCF files for two affected dogs, and SAMtools and BCFtools [69] were used to generate a VCF file for a third case. A VCF containing publicly-available WGS data from 1,330 dogs of pure and mixed breeds and 54 wild canids, generated following the methods described in [70], was used to assess allele frequencies in a broader canid population (S6 Table). Homozygous variants shared across the three affected dogs were selected using SVS. Within a 648 kb region of high homozygosity among cases (chr12: 57,747,395,480), variants present in a male GSD lacking the risk haplotype (SRX4036121) were excluded from further analysis.
Structural VCF files were generated for the three affected GSDs and one GSD lacking the risk haplotype using SvABA [71]. The following settings for SvABA were used: '-r all', '-k chr12:1-72,498,081', and '-p 19'. The presence of alternate structural variants in other breeds was manually investigated in IGV using 15 genomes of nine other breeds (see Data Availability Statement).
Coding and splice site variants within predicted exons plus 50 bp flanking sequences were identified using CanFam3.1 Ensembl 89. Promoter and untranslated regions were defined in hg38 using GENCODEV36 and GeneHancer v5.4 and lifted over in the UCSC Genome Browser to CanFam3.1 positions. Transcription factor binding motifs were identified in TOM-TOM [72].

Genotyping
Primer sequences are given in S7 Table. The PCR for CFRNASEQ_IGNC_Spliced_00025252 g.58216509_58216510ins(6444) used two forward primers, one upstream of and one within the LINE insertion, and a single downstream reverse primer. PCR for MCHR2 g.58084223_58084226del was carried out using 2X ReddyMix (Thermo Scientific), and PCRs for MCHR2 g.58093157T>A, MCHR2 g.58117748_58117780del, chr12.g.58158449A>G, and CFRNASEQ_IGNC_Spliced_00025252 g.58216509_58216510ins(6444) were carried out using Taq DNA Polymerase (Fisher BioReagents). MCHR2 g.58117748_58117780del and CFRNASE-Q_IGNC_Spliced_00025252 g.58216509_58216510ins(6444) PCR products were run on a 3% agarose gel to determine genotypes by size (S4 Fig). Sanger sequencing (Eton Bioscience) was performed for the remaining variants using the BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems) and an ABI 3730xl DNA Analyzer (Applied Biosystems). Three cases and two controls from the GWAS population were excluded from variant genotyping due to inadequate DNA quantities.
The VCF of 1,384 canid genomes was used to genotype MCHR2 g.58117748_58117780del. At this positon, the reference allele contained two copies of the repeat and the alternate alleles were denoted as either a 33 bp deletion or insertion, corresponding to one or three copies of the repeat, respectively.

Statistical analyses
Fisher's exact two-tailed P-values were calculated to evaluate allelic and genotypic associations with CIM using VassarStats (http://vassarstats.net/). Because only 33 GSDs possessed the VNTR allele 2, those individuals were excluded from the VNTR allelic association analysis. The genotypic associations of 1/1 and 1/3 with CIM were calculated using 3/3 dogs as a comparison. A one-way chi square test was used to assess the significance of male overrepresentation among cases (http://vassarstats.net/). Two-sample t tests were used to evaluate mean weight differences in males vs. females and affected vs. unaffected dogs. Probability of disease was calculated for each sex by dividing the number of cases having a particular genotype by the total number of dogs with that genotype. Fisher's exact one-tailed P-values were calculated to assess the significance of disease probability differences between the sexes within the 1/1 and 1/3 genotypic groups (http://vassarstats.net/). Our disease phenotype is binary, dogs are either affected or unaffected. The evaluation of such data is typically conducted with logistic regression, where we define the probability of disease for a dog of the i-th sex and j-th genotypic class as p ij . Accordingly, we define the logit of this probability as y ij ¼ log½p ij =ð1 À p ij Þ�where the subsequent analysis is built with the following linear model: where b 0 is an unknown constant common to all dogs, sex i is the contribution of the i-th sex (i = F/M) and genotype j is the contribution of the j-th genotypic (j = 1/1, 1/2, 1/3, 2/3, 3/3) class. Estimation of the unknown effects and predictions of the risk of disease, are provided by the glm function of the public domain language R [73]. The accuracy of this model (and any additional sub-models) in the prediction of disease can be assessed through the receiver operating characteristic curve (using the area under the curve), fitted with the R package pROC [74].

Dyad DOI
https://doi.org/10.5061/dryad.f7m0cfxz3 [75] Supporting information   Table. Candidate variants in chromosome 12 region of high LD. � The MCHR2 g.58117748_58117780 reference allele includes two copies and alternate alleles have one (del) or three (dup) copies. Because the two-copy allele is uncommon in GSDs, individuals having two-copy alleles were excluded from the MCHR2 g.58117748_58117780del statistics. For the remaining variants, A1 corresponds to the alternate allele and A2 to the reference allele.  (2)